Perspectives on validation of clinical predictive algorithms

The generalizability of predictive algorithms is of key relevance to application in clinical practice. We provide an overview of three types of generalizability, based on existing literature: temporal, geographical, and domain generalizability. These generalizability types are linked to their associated goals, methodology, and stakeholders.

The 'generalizability of the model' is too vague in these otherwise excellent best practices by Kakarmath  Regulatory body "Did you put in place verification and validation methods and documentation (e.g., logging) to evaluate and ensure different aspects of the AI system's reliability and reproducibility? Did you clearly document and operationalize processes for the testing and verification of the reliability and reproducibility of the AI system?" [3] In this document by the High-Level Expert Group on Artificial Intelligence, 'reproducibility' may refer to internal validity, but this could be made explicit. 'Validation methods' are too vague. It is not clear if this only refers to internal validity or also extends to external validity, for example temporal validity.
Suggestion: 'Did you put in place internal validation methods and documentation (e.g., logging) to evaluate and ensure different aspects of the AI system's reliability and reproducibility? Did you clearly document and operationalize processes for the testing and verification of the internal validity (and possibly temporal validity) of the AI system?'.
Regulatory body "The product must then be validated which usually involves being tested in a setting that represents the intended population and/or environment." [4] The UK Department for Health and Social Care correctly states that the validation should be aligned with the intended use of the predictive algorithm. It could be further improved by phrasing this more strictly as a best practice.
Suggestion: 'The product must then be validated in a setting that represents the intended operational period, (clinical) population and environment'.
Regulatory body "Analytical validation confirms and provides objective evidence that the software was correctly constructed -namely, correctly and reliably processes input data and generates output data with the appropriate level of accuracy, and repeatability and reproducibility … Analytical validation is necessary for any SaMD." [5] It is unclear what is referred to by 'analytical validation' in this FDA document. From the context, it appears to refer to internal validation ('reproducibility').
Suggestion: 'Internal validation confirms and provides objective evidence that the software was correctly constructednamely, correctly and reliably processes input data and generates output data with the appropriate level of accuracy, and repeatability and reproducibility… Internal validation is necessary for any SaMD'.

Internal validity
The 4C Mortality Score was developed in 260 hospitals across the United Kingdom and L1 penalized coefficients were derived using 10-fold cross validation [8].
A machine learning model for predicting readmission or death within 7 days after ICU discharge was developed in one academic medical center in the Netherlands using 10-fold cross-validation [9].

Temporal generalizability
The 4C Mortality score was developed on a cohort recruited between 6 February and 20 May 2020 and temporally validated on a cohort recruited between 21 May and 29 June 2020 [8].
A machine learning model for predicting readmission or death within 7 days after ICU discharge was developed on a cohort recruited between 2004 and March 2016 and temporally validated on a cohort recruited between March 2016 and 2019 [9].

Geographical generalizability
The 4C Mortality Score was externally validated in a Canadian and Japanese cohort with different geography from the development cohort (United Kingdom) [10,11].
A sepsis prediction model was externally validated in an academic health system with different geography from the development cohort [12].
A machine learning model for predicting readmission or death within 7 days after ICU discharge was externally validated in an academic medical center (Leiden University Medical Center) with different geography from the development center (Amsterdam Medical Center, Location VUmc) [13].

Domain generalizability
An emergency medicine admissions model was developed in two urban teaching centers and one tertiary care center in the Netherlands. It was validated by a leave one group out cross-validation where each center formed one group to address the heterogeneity between the different hospital types [14].
Three ED disposition prediction model were developed. Two were developed in Children hospital's EDs and one in a community general hospital's ED. These models were than validated on the other two sites [15].